The TDIL Program and the Indian Langauge Corpora Intitiative (ILCI)
نویسنده
چکیده
India is considered a linguistic ocean with 4 language families and 22 scheduled national languages, and 100 un-scheduled languages reported by the 2001 census. This puts tremendous pressures on the Indian government to not only have comprehensive language policies, but also to create resources for their maintenance and development. In the age of information technology, there is a greater need to have a fine balance between allocation of resources to each language keeping in view the political compulsions, electoral potential of a linguistic community and other issues. In this connection, the government of India through various ministries and a think tank consisting of eminent linguistics and policy makers has done a commendable job despite the obvious roadblocks. This paper describes the Indian government’s policies towards language development and maintenance in the age of technology through the Ministry of HRD through its various agencies and the Ministry of Communications & Information Technology (MCIT) through its dedicated program called TDIL (Technology Development for Indian Languages). The paper also describes some of the recent activities of the TDIL in general and in particular, an innovative corpora project called ILCI Indian Languages Corpora Initiative. 1. The linguistic scene in India India has a very complex and peculiar linguistic situation. There are 4 language families (Baldridge 96) – Indo Aryan (76.87 % speakers), Dravidian (20.82 % speakers), Austro-Asiatic (1.11 %), and Tibeto-Burman (1%) . These have 22 constitutionally recognized (scheduled) languages out of which Hindi has the ‘official’ status in addition to having the ‘national’ status. English which is not a national language of India has the status of ‘associate official’ language. Besides these, India has 100 mother tongues reported by the recent census (2001), and many more (running up to 1000) documented languages and dialects. A new language family called ‘Andamanese’ has been recently discovered (Abbi 2001), and the possibility of another – the 6th family called ‘Great Andamanese’– is very likely. Of the major Indian languages, Hindi is spoken in 10 states of India with a total population of over 45 % followed by Telugu and Bangla. Not only the languages, there are multitude of scripts as well. India has more than 18 scripts in India which need to be standardized and supported by technology. 2. Constitutional provisions and the language policy of India Indian has currently 25 states and 7 union territories (UTs). The Indian constitution adopted in 1950 lists 14 languages as scheduled languages of the union. In the schedule called THE EIGHTH SCHEDULE: (Articles 344 (1) and 351), the following languages are listed – • Assamese • Bengali • Gujarati • Hindi • Kannada • Kashmiri • Konkani • Malayalam • Manipuri • Marathi • Nepali • Oriya • Punjabi • Sanskrit • Sindhi • Tamil • Telugu • Urdu • Maithili • Bodo • Santhali • Dogri The Indian states and UTs have exclusive rights on their regional languages. However if the language of the state is also listed as scheduled language (as above), then the union has a constitutional obligation to promote the language. Hindi, besides being a scheduled (national) language with 21 other languages, is also the official language of the union with English as its associate. Hindi is the official language of 10 out of 25 Indian states and spoken by more than 42% of Indian population. Each of the 25 states and 7 UTs in India can have its official language (one from the 22 listed above) and several other minor languages the speakers of which have a fundamental right to maintain and promote these languages. The
منابع مشابه
Generating Translation Corpora in Indic Languages: Cultivating Bilingual Texts for Cross Lingual Fertilization
We address some theoretical and practical issues relating to generation, processing, and management of Translation Corpus (TC) in Indian languages, which is developed in a consortium-mode project (ILCI-II) 1 under the DeitY, Govt. of India. Issues are discussed here for the first time keeping in mind the ready application of TC in various domains of computational and applied linguistics. We fir...
متن کاملBIS Annotation Standards With Reference to Konkani Language
The Bureau of Indian Standards (BIS) Part Of Speech (POS) tagset has been prepared for the Indian Languages by the POS Tag Standardization Committee of Department of Information Technology (DIT), New Delhi, India. The BIS POS tagset aims to ensure standardization in the POS tagging of all the Indian Languages. It has been used for POS tagging in the Indian Languages Corpora Initiative (ILCI) pr...
متن کاملCreating Multilingual Parallel Corpora in Indian Languages
This paper presents a description of the parallel corpora being created simultaneously in 12 major Indian languages including English under a nationally funded project named Indian Languages Corpora Initiative (ILCI) run through a consortium of institutions across India. The project runs in two phases. The first phase of the project has two distinct goals creating parallel sentence aligned corp...
متن کاملNlp Challenges for Machine Translation from English to Indian Languages
This Natural Langauge processing is carried particularly on English-Kannada/Telugu. Kannada is a language of India. The Kannada language has a classification of Dravidian, Southern, Tamil-Kannada, and Kannada. Regions Spoken: Kannada is also spoken in Karnataka, Andhra Pradesh, Tamil Nadu, and Maharashtra. Population: The total population of people who speak Kannada is 35,346,000, as of 1997. A...
متن کاملMTIL17: English to Indian Langauge Statistical Machine Translation
English to Indian language machine translation poses the challenge of structural and morphological divergence. This paper describes English to Indian language statistical machine translation using pre-ordering and suffix separation. The pre-ordering uses rules to transfer the structure of the source sentences prior to training and translation. This syntactic restructuring helps statistical mach...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010